1.- DEFINIR OBJETIVOS. Los objetivos de la práctica son aplicar las técnicas de análisis exploratorio de datos (EDA) para analizar los patrones presentes en un conjunto de datos relacionados con solicitudes de préstamos. Realizaremos un análisis EDA para analizar los patrones presentes en los datos, asegurando que los solicitantes capaces de pagar el préstamo no sean rechazados. Se identificarán patrones que indiquen si un cliente puede tener dificultades para devolver el préstamo, lo que puede ser utilizado para tomar acciones. Para esto cogeremos la variable objetivo "TARGET".

El trabajo se va a dividir en los mismos notebooks que los HTML de guía y se tratará desde el notebook 01_Exploración general, donde se realizan los primeros pasos del EDA como la carga del dataframe, tratamiento de variables a futuro, dimensiones del dataset..., hasta la parte de modelo (no incluida) del notebook 03_encoding_categorical_scaled_vars.

En cuanto al problema a resolver, sera poder predecir la probabilidad de que un cliente tenga dificultades para pagar un préstamo. Es un problema de clasificación binaria, donde las clases son: 0: El cliente no tuvo dificultades para pagar el préstamo. 1: El cliente tuvo dificultades para pagar el préstamo. La meta principal es desarrollar un modelo que permita anticipar clientes con alto riesgo de incumplimiento de pagos.

Las variables a utilizar las iremos viendo a través del análisis del dataset, algunas interesantes a priori podrían ser el tipo de préstamo, el trabajo que desarrollan los clientes...

En primer lugar, cargamos todas las librerías y ponemos el máximo de columnas y filas que va a mostrar asi como el ancho a mostrar del dataframe

In [50]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import plotly.express as px
import plotly.io as pio
import sys
import os
pio.renderers.default = 'notebook'

pd.set_option('display.max_columns', 10000)
pd.set_option('display.max_rows', 10000)
pd.set_option('display.width', 10000)
In [51]:
sys.path.append(os.path.abspath('../src'))
import f_aux

Importo el diccionario de datos

In [52]:
dict_path = '../data/columns_description.xlsx'
dictionary = pd.read_excel(dict_path)
In [53]:
dictionary.head(1)
Out[53]:
Unnamed: 0 Table Row Description Special
0 1 application_data SK_ID_CURR ID of loan in our sample NaN

Veo que tiene una columna creada tras la lectura del mismo como excel, procedo a eliminarla

In [54]:
dictionary = dictionary.drop(columns=['Unnamed: 0'])

Ahora se puede ver el total de columnas y que significa cada una de ellas

In [55]:
dictionary.head(122)
Out[55]:
Table Row Description Special
0 application_data SK_ID_CURR ID of loan in our sample NaN
1 application_data TARGET Target variable (1 - client with payment diffi... NaN
2 application_data NAME_CONTRACT_TYPE Identification if loan is cash or revolving NaN
3 application_data CODE_GENDER Gender of the client NaN
4 application_data FLAG_OWN_CAR Flag if the client owns a car NaN
5 application_data FLAG_OWN_REALTY Flag if client owns a house or flat NaN
6 application_data CNT_CHILDREN Number of children the client has NaN
7 application_data AMT_INCOME_TOTAL Income of the client NaN
8 application_data AMT_CREDIT Credit amount of the loan NaN
9 application_data AMT_ANNUITY Loan annuity NaN
10 application_data AMT_GOODS_PRICE For consumer loans it is the price of the good... NaN
11 application_data NAME_TYPE_SUITE Who was accompanying client when he was applyi... NaN
12 application_data NAME_INCOME_TYPE Clients income type (businessman, working, mat... NaN
13 application_data NAME_EDUCATION_TYPE Level of highest education the client achieved NaN
14 application_data NAME_FAMILY_STATUS Family status of the client NaN
15 application_data NAME_HOUSING_TYPE What is the housing situation of the client (r... NaN
16 application_data REGION_POPULATION_RELATIVE Normalized population of region where client l... normalized
17 application_data DAYS_BIRTH Client's age in days at the time of application time only relative to the application
18 application_data DAYS_EMPLOYED How many days before the application the perso... time only relative to the application
19 application_data DAYS_REGISTRATION How many days before the application did clien... time only relative to the application
20 application_data DAYS_ID_PUBLISH How many days before the application did clien... time only relative to the application
21 application_data OWN_CAR_AGE Age of client's car NaN
22 application_data FLAG_MOBIL Did client provide mobile phone (1=YES, 0=NO) NaN
23 application_data FLAG_EMP_PHONE Did client provide work phone (1=YES, 0=NO) NaN
24 application_data FLAG_WORK_PHONE Did client provide home phone (1=YES, 0=NO) NaN
25 application_data FLAG_CONT_MOBILE Was mobile phone reachable (1=YES, 0=NO) NaN
26 application_data FLAG_PHONE Did client provide home phone (1=YES, 0=NO) NaN
27 application_data FLAG_EMAIL Did client provide email (1=YES, 0=NO) NaN
28 application_data OCCUPATION_TYPE What kind of occupation does the client have NaN
29 application_data CNT_FAM_MEMBERS How many family members does client have NaN
30 application_data REGION_RATING_CLIENT Our rating of the region where client lives (1... NaN
31 application_data REGION_RATING_CLIENT_W_CITY Our rating of the region where client lives wi... NaN
32 application_data WEEKDAY_APPR_PROCESS_START On which day of the week did the client apply ... NaN
33 application_data HOUR_APPR_PROCESS_START Approximately at what hour did the client appl... rounded
34 application_data REG_REGION_NOT_LIVE_REGION Flag if client's permanent address does not ma... NaN
35 application_data REG_REGION_NOT_WORK_REGION Flag if client's permanent address does not ma... NaN
36 application_data LIVE_REGION_NOT_WORK_REGION Flag if client's contact address does not matc... NaN
37 application_data REG_CITY_NOT_LIVE_CITY Flag if client's permanent address does not ma... NaN
38 application_data REG_CITY_NOT_WORK_CITY Flag if client's permanent address does not ma... NaN
39 application_data LIVE_CITY_NOT_WORK_CITY Flag if client's contact address does not matc... NaN
40 application_data ORGANIZATION_TYPE Type of organization where client works NaN
41 application_data EXT_SOURCE_1 Normalized score from external data source normalized
42 application_data EXT_SOURCE_2 Normalized score from external data source normalized
43 application_data EXT_SOURCE_3 Normalized score from external data source normalized
44 application_data APARTMENTS_AVG Normalized information about building where th... normalized
45 application_data BASEMENTAREA_AVG Normalized information about building where th... normalized
46 application_data YEARS_BEGINEXPLUATATION_AVG Normalized information about building where th... normalized
47 application_data YEARS_BUILD_AVG Normalized information about building where th... normalized
48 application_data COMMONAREA_AVG Normalized information about building where th... normalized
49 application_data ELEVATORS_AVG Normalized information about building where th... normalized
50 application_data ENTRANCES_AVG Normalized information about building where th... normalized
51 application_data FLOORSMAX_AVG Normalized information about building where th... normalized
52 application_data FLOORSMIN_AVG Normalized information about building where th... normalized
53 application_data LANDAREA_AVG Normalized information about building where th... normalized
54 application_data LIVINGAPARTMENTS_AVG Normalized information about building where th... normalized
55 application_data LIVINGAREA_AVG Normalized information about building where th... normalized
56 application_data NONLIVINGAPARTMENTS_AVG Normalized information about building where th... normalized
57 application_data NONLIVINGAREA_AVG Normalized information about building where th... normalized
58 application_data APARTMENTS_MODE Normalized information about building where th... normalized
59 application_data BASEMENTAREA_MODE Normalized information about building where th... normalized
60 application_data YEARS_BEGINEXPLUATATION_MODE Normalized information about building where th... normalized
61 application_data YEARS_BUILD_MODE Normalized information about building where th... normalized
62 application_data COMMONAREA_MODE Normalized information about building where th... normalized
63 application_data ELEVATORS_MODE Normalized information about building where th... normalized
64 application_data ENTRANCES_MODE Normalized information about building where th... normalized
65 application_data FLOORSMAX_MODE Normalized information about building where th... normalized
66 application_data FLOORSMIN_MODE Normalized information about building where th... normalized
67 application_data LANDAREA_MODE Normalized information about building where th... normalized
68 application_data LIVINGAPARTMENTS_MODE Normalized information about building where th... normalized
69 application_data LIVINGAREA_MODE Normalized information about building where th... normalized
70 application_data NONLIVINGAPARTMENTS_MODE Normalized information about building where th... normalized
71 application_data NONLIVINGAREA_MODE Normalized information about building where th... normalized
72 application_data APARTMENTS_MEDI Normalized information about building where th... normalized
73 application_data BASEMENTAREA_MEDI Normalized information about building where th... normalized
74 application_data YEARS_BEGINEXPLUATATION_MEDI Normalized information about building where th... normalized
75 application_data YEARS_BUILD_MEDI Normalized information about building where th... normalized
76 application_data COMMONAREA_MEDI Normalized information about building where th... normalized
77 application_data ELEVATORS_MEDI Normalized information about building where th... normalized
78 application_data ENTRANCES_MEDI Normalized information about building where th... normalized
79 application_data FLOORSMAX_MEDI Normalized information about building where th... normalized
80 application_data FLOORSMIN_MEDI Normalized information about building where th... normalized
81 application_data LANDAREA_MEDI Normalized information about building where th... normalized
82 application_data LIVINGAPARTMENTS_MEDI Normalized information about building where th... normalized
83 application_data LIVINGAREA_MEDI Normalized information about building where th... normalized
84 application_data NONLIVINGAPARTMENTS_MEDI Normalized information about building where th... normalized
85 application_data NONLIVINGAREA_MEDI Normalized information about building where th... normalized
86 application_data FONDKAPREMONT_MODE Normalized information about building where th... normalized
87 application_data HOUSETYPE_MODE Normalized information about building where th... normalized
88 application_data TOTALAREA_MODE Normalized information about building where th... normalized
89 application_data WALLSMATERIAL_MODE Normalized information about building where th... normalized
90 application_data EMERGENCYSTATE_MODE Normalized information about building where th... normalized
91 application_data OBS_30_CNT_SOCIAL_CIRCLE How many observation of client's social surrou... NaN
92 application_data DEF_30_CNT_SOCIAL_CIRCLE How many observation of client's social surrou... NaN
93 application_data OBS_60_CNT_SOCIAL_CIRCLE How many observation of client's social surrou... NaN
94 application_data DEF_60_CNT_SOCIAL_CIRCLE How many observation of client's social surrou... NaN
95 application_data DAYS_LAST_PHONE_CHANGE How many days before application did client ch... NaN
96 application_data FLAG_DOCUMENT_2 Did client provide document 2 NaN
97 application_data FLAG_DOCUMENT_3 Did client provide document 3 NaN
98 application_data FLAG_DOCUMENT_4 Did client provide document 4 NaN
99 application_data FLAG_DOCUMENT_5 Did client provide document 5 NaN
100 application_data FLAG_DOCUMENT_6 Did client provide document 6 NaN
101 application_data FLAG_DOCUMENT_7 Did client provide document 7 NaN
102 application_data FLAG_DOCUMENT_8 Did client provide document 8 NaN
103 application_data FLAG_DOCUMENT_9 Did client provide document 9 NaN
104 application_data FLAG_DOCUMENT_10 Did client provide document 10 NaN
105 application_data FLAG_DOCUMENT_11 Did client provide document 11 NaN
106 application_data FLAG_DOCUMENT_12 Did client provide document 12 NaN
107 application_data FLAG_DOCUMENT_13 Did client provide document 13 NaN
108 application_data FLAG_DOCUMENT_14 Did client provide document 14 NaN
109 application_data FLAG_DOCUMENT_15 Did client provide document 15 NaN
110 application_data FLAG_DOCUMENT_16 Did client provide document 16 NaN
111 application_data FLAG_DOCUMENT_17 Did client provide document 17 NaN
112 application_data FLAG_DOCUMENT_18 Did client provide document 18 NaN
113 application_data FLAG_DOCUMENT_19 Did client provide document 19 NaN
114 application_data FLAG_DOCUMENT_20 Did client provide document 20 NaN
115 application_data FLAG_DOCUMENT_21 Did client provide document 21 NaN
116 application_data AMT_REQ_CREDIT_BUREAU_HOUR Number of enquiries to Credit Bureau about the... NaN
117 application_data AMT_REQ_CREDIT_BUREAU_DAY Number of enquiries to Credit Bureau about the... NaN
118 application_data AMT_REQ_CREDIT_BUREAU_WEEK Number of enquiries to Credit Bureau about the... NaN
119 application_data AMT_REQ_CREDIT_BUREAU_MON Number of enquiries to Credit Bureau about the... NaN
120 application_data AMT_REQ_CREDIT_BUREAU_QRT Number of enquiries to Credit Bureau about the... NaN
121 application_data AMT_REQ_CREDIT_BUREAU_YEAR Number of enquiries to Credit Bureau about the... NaN

Vamos a trabajar con 122 variables en el dataset

Creamos una ruta relativa que cogerá los elementos .csv de la carpeta data y leemos el dataset. De esta manera, si alguien descarga en su ordenador el archivo .zip debería ser capaz de leer los .csv de la carpeta data. En este caso nos han pedido que el .zip vaya sin datos para no saturar la nube.

In [56]:
path_folder = "../data/"
pd_loan = pd.read_csv(path_folder +"application_data.csv", low_memory=False)
pd_loan.head()
Out[56]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.018801 -9461 -637 -3648.0 -2120 NaN 1 1 0 1 1 0 Laborers 1.0 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.083037 0.262949 0.139376 0.0247 0.0369 0.9722 0.6192 0.0143 0.00 0.0690 0.0833 0.1250 0.0369 0.0202 0.0190 0.0000 0.0000 0.0252 0.0383 0.9722 0.6341 0.0144 0.0000 0.0690 0.0833 0.1250 0.0377 0.022 0.0198 0.0 0.0 0.0250 0.0369 0.9722 0.6243 0.0144 0.00 0.0690 0.0833 0.1250 0.0375 0.0205 0.0193 0.0000 0.00 reg oper account block of flats 0.0149 Stone, brick No 2.0 2.0 2.0 2.0 -1134.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 Family State servant Higher education Married House / apartment 0.003541 -16765 -1188 -1186.0 -291 NaN 1 1 0 1 1 0 Core staff 2.0 1 1 MONDAY 11 0 0 0 0 0 0 School 0.311267 0.622246 NaN 0.0959 0.0529 0.9851 0.7960 0.0605 0.08 0.0345 0.2917 0.3333 0.0130 0.0773 0.0549 0.0039 0.0098 0.0924 0.0538 0.9851 0.8040 0.0497 0.0806 0.0345 0.2917 0.3333 0.0128 0.079 0.0554 0.0 0.0 0.0968 0.0529 0.9851 0.7987 0.0608 0.08 0.0345 0.2917 0.3333 0.0132 0.0787 0.0558 0.0039 0.01 reg oper account block of flats 0.0714 Block No 1.0 0.0 1.0 0.0 -828.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 135000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.010032 -19046 -225 -4260.0 -2531 26.0 1 1 1 1 1 0 Laborers 1.0 2 2 MONDAY 9 0 0 0 0 0 0 Government NaN 0.555912 0.729567 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -815.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 297000.0 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.008019 -19005 -3039 -9833.0 -2437 NaN 1 1 0 1 0 0 Laborers 2.0 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 NaN 0.650442 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 0.0 2.0 0.0 -617.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 513000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.028663 -19932 -3038 -4311.0 -3458 NaN 1 1 0 1 0 0 Core staff 1.0 2 2 THURSDAY 11 0 0 0 0 1 1 Religion NaN 0.322738 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -1106.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
  1. VARIABLES A FUTURO: Vemos que variables son "variables a futuro". Esto lo podemos observar en el excel adicional donde nos explica para que funciona cada columna y podemos recogerlas para despues eliminarlas, ya que todas las variables que recojan valores futuros no son interesantes para el análisis debido a que una persona NUEVA que pide un préstamos no tiene datos anteriores a la toma del préstamo. En este caso no hay variables a futuro observadas.
  1. VEMOS LAS DIMENSIONES DE LA TABLA
In [57]:
print(len(pd_loan.columns))
print(pd_loan.shape)
122
(307511, 122)

Encontramos que el dataframe tiene unas dimensiones de 307511 filas y 122 columnas, lo que nos dice a su vez que vamos a tratar con un dataset de 122 variables.

Buscamos la variable que hace únicos a todos los registros, funciona como el identificador de las filas. Vemos que esta es 'SK_ID_CURR', esta funciona, como su propio nombre indica, como identificador o ID, de ahi que salga que tiene tantos valores únicos como filas tiene nuestro dataset.

In [58]:
pd_loan.columns
pd_loan['SK_ID_CURR'].nunique()
Out[58]:
307511
  1. VER LOS TIPOS DE VARIABLES
In [59]:
pd_loan.dtypes.to_dict()
Out[59]:
{'SK_ID_CURR': dtype('int64'),
 'TARGET': dtype('int64'),
 'NAME_CONTRACT_TYPE': dtype('O'),
 'CODE_GENDER': dtype('O'),
 'FLAG_OWN_CAR': dtype('O'),
 'FLAG_OWN_REALTY': dtype('O'),
 'CNT_CHILDREN': dtype('int64'),
 'AMT_INCOME_TOTAL': dtype('float64'),
 'AMT_CREDIT': dtype('float64'),
 'AMT_ANNUITY': dtype('float64'),
 'AMT_GOODS_PRICE': dtype('float64'),
 'NAME_TYPE_SUITE': dtype('O'),
 'NAME_INCOME_TYPE': dtype('O'),
 'NAME_EDUCATION_TYPE': dtype('O'),
 'NAME_FAMILY_STATUS': dtype('O'),
 'NAME_HOUSING_TYPE': dtype('O'),
 'REGION_POPULATION_RELATIVE': dtype('float64'),
 'DAYS_BIRTH': dtype('int64'),
 'DAYS_EMPLOYED': dtype('int64'),
 'DAYS_REGISTRATION': dtype('float64'),
 'DAYS_ID_PUBLISH': dtype('int64'),
 'OWN_CAR_AGE': dtype('float64'),
 'FLAG_MOBIL': dtype('int64'),
 'FLAG_EMP_PHONE': dtype('int64'),
 'FLAG_WORK_PHONE': dtype('int64'),
 'FLAG_CONT_MOBILE': dtype('int64'),
 'FLAG_PHONE': dtype('int64'),
 'FLAG_EMAIL': dtype('int64'),
 'OCCUPATION_TYPE': dtype('O'),
 'CNT_FAM_MEMBERS': dtype('float64'),
 'REGION_RATING_CLIENT': dtype('int64'),
 'REGION_RATING_CLIENT_W_CITY': dtype('int64'),
 'WEEKDAY_APPR_PROCESS_START': dtype('O'),
 'HOUR_APPR_PROCESS_START': dtype('int64'),
 'REG_REGION_NOT_LIVE_REGION': dtype('int64'),
 'REG_REGION_NOT_WORK_REGION': dtype('int64'),
 'LIVE_REGION_NOT_WORK_REGION': dtype('int64'),
 'REG_CITY_NOT_LIVE_CITY': dtype('int64'),
 'REG_CITY_NOT_WORK_CITY': dtype('int64'),
 'LIVE_CITY_NOT_WORK_CITY': dtype('int64'),
 'ORGANIZATION_TYPE': dtype('O'),
 'EXT_SOURCE_1': dtype('float64'),
 'EXT_SOURCE_2': dtype('float64'),
 'EXT_SOURCE_3': dtype('float64'),
 'APARTMENTS_AVG': dtype('float64'),
 'BASEMENTAREA_AVG': dtype('float64'),
 'YEARS_BEGINEXPLUATATION_AVG': dtype('float64'),
 'YEARS_BUILD_AVG': dtype('float64'),
 'COMMONAREA_AVG': dtype('float64'),
 'ELEVATORS_AVG': dtype('float64'),
 'ENTRANCES_AVG': dtype('float64'),
 'FLOORSMAX_AVG': dtype('float64'),
 'FLOORSMIN_AVG': dtype('float64'),
 'LANDAREA_AVG': dtype('float64'),
 'LIVINGAPARTMENTS_AVG': dtype('float64'),
 'LIVINGAREA_AVG': dtype('float64'),
 'NONLIVINGAPARTMENTS_AVG': dtype('float64'),
 'NONLIVINGAREA_AVG': dtype('float64'),
 'APARTMENTS_MODE': dtype('float64'),
 'BASEMENTAREA_MODE': dtype('float64'),
 'YEARS_BEGINEXPLUATATION_MODE': dtype('float64'),
 'YEARS_BUILD_MODE': dtype('float64'),
 'COMMONAREA_MODE': dtype('float64'),
 'ELEVATORS_MODE': dtype('float64'),
 'ENTRANCES_MODE': dtype('float64'),
 'FLOORSMAX_MODE': dtype('float64'),
 'FLOORSMIN_MODE': dtype('float64'),
 'LANDAREA_MODE': dtype('float64'),
 'LIVINGAPARTMENTS_MODE': dtype('float64'),
 'LIVINGAREA_MODE': dtype('float64'),
 'NONLIVINGAPARTMENTS_MODE': dtype('float64'),
 'NONLIVINGAREA_MODE': dtype('float64'),
 'APARTMENTS_MEDI': dtype('float64'),
 'BASEMENTAREA_MEDI': dtype('float64'),
 'YEARS_BEGINEXPLUATATION_MEDI': dtype('float64'),
 'YEARS_BUILD_MEDI': dtype('float64'),
 'COMMONAREA_MEDI': dtype('float64'),
 'ELEVATORS_MEDI': dtype('float64'),
 'ENTRANCES_MEDI': dtype('float64'),
 'FLOORSMAX_MEDI': dtype('float64'),
 'FLOORSMIN_MEDI': dtype('float64'),
 'LANDAREA_MEDI': dtype('float64'),
 'LIVINGAPARTMENTS_MEDI': dtype('float64'),
 'LIVINGAREA_MEDI': dtype('float64'),
 'NONLIVINGAPARTMENTS_MEDI': dtype('float64'),
 'NONLIVINGAREA_MEDI': dtype('float64'),
 'FONDKAPREMONT_MODE': dtype('O'),
 'HOUSETYPE_MODE': dtype('O'),
 'TOTALAREA_MODE': dtype('float64'),
 'WALLSMATERIAL_MODE': dtype('O'),
 'EMERGENCYSTATE_MODE': dtype('O'),
 'OBS_30_CNT_SOCIAL_CIRCLE': dtype('float64'),
 'DEF_30_CNT_SOCIAL_CIRCLE': dtype('float64'),
 'OBS_60_CNT_SOCIAL_CIRCLE': dtype('float64'),
 'DEF_60_CNT_SOCIAL_CIRCLE': dtype('float64'),
 'DAYS_LAST_PHONE_CHANGE': dtype('float64'),
 'FLAG_DOCUMENT_2': dtype('int64'),
 'FLAG_DOCUMENT_3': dtype('int64'),
 'FLAG_DOCUMENT_4': dtype('int64'),
 'FLAG_DOCUMENT_5': dtype('int64'),
 'FLAG_DOCUMENT_6': dtype('int64'),
 'FLAG_DOCUMENT_7': dtype('int64'),
 'FLAG_DOCUMENT_8': dtype('int64'),
 'FLAG_DOCUMENT_9': dtype('int64'),
 'FLAG_DOCUMENT_10': dtype('int64'),
 'FLAG_DOCUMENT_11': dtype('int64'),
 'FLAG_DOCUMENT_12': dtype('int64'),
 'FLAG_DOCUMENT_13': dtype('int64'),
 'FLAG_DOCUMENT_14': dtype('int64'),
 'FLAG_DOCUMENT_15': dtype('int64'),
 'FLAG_DOCUMENT_16': dtype('int64'),
 'FLAG_DOCUMENT_17': dtype('int64'),
 'FLAG_DOCUMENT_18': dtype('int64'),
 'FLAG_DOCUMENT_19': dtype('int64'),
 'FLAG_DOCUMENT_20': dtype('int64'),
 'FLAG_DOCUMENT_21': dtype('int64'),
 'AMT_REQ_CREDIT_BUREAU_HOUR': dtype('float64'),
 'AMT_REQ_CREDIT_BUREAU_DAY': dtype('float64'),
 'AMT_REQ_CREDIT_BUREAU_WEEK': dtype('float64'),
 'AMT_REQ_CREDIT_BUREAU_MON': dtype('float64'),
 'AMT_REQ_CREDIT_BUREAU_QRT': dtype('float64'),
 'AMT_REQ_CREDIT_BUREAU_YEAR': dtype('float64')}

Aparentemente vamos a tratar con 3 tipos de variables, continuas ('float64', 'int64'), categóricas ('category') y booleanas ('0') que son aquellas que solo aceptan 2 tipos de valores, dichos valores suelen significar valores dicotomicos. EJ. FLAG_DOCUMENT_2 es una variable que recoge si el documento 2 fue entregado por el tomador del préstamo, esta variable solo toma el valor 0, que asumiremos que es entregado, y 1, que asumiremos que es no entregado. También se utilizaba antiguamente a menudo para variables de género siendo, por ejemplo, 0 = Masculino, 1 = Feminino.

In [60]:
#Returns us in categorical variables (From 'TARGET' to 'FLAG_DOCUMENT_21') and in Others, which do not fulfill the function ('SK_ID_CURR', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH'). 
#The total number of categorical variables is 53 as it tells us at the end.
f_aux.dame_variables_categoricas(pd_loan)
Out[60]:
(['TARGET',
  'NAME_CONTRACT_TYPE',
  'CODE_GENDER',
  'FLAG_OWN_CAR',
  'FLAG_OWN_REALTY',
  'CNT_CHILDREN',
  'NAME_TYPE_SUITE',
  'NAME_INCOME_TYPE',
  'NAME_EDUCATION_TYPE',
  'NAME_FAMILY_STATUS',
  'NAME_HOUSING_TYPE',
  'FLAG_MOBIL',
  'FLAG_EMP_PHONE',
  'FLAG_WORK_PHONE',
  'FLAG_CONT_MOBILE',
  'FLAG_PHONE',
  'FLAG_EMAIL',
  'OCCUPATION_TYPE',
  'REGION_RATING_CLIENT',
  'REGION_RATING_CLIENT_W_CITY',
  'WEEKDAY_APPR_PROCESS_START',
  'HOUR_APPR_PROCESS_START',
  'REG_REGION_NOT_LIVE_REGION',
  'REG_REGION_NOT_WORK_REGION',
  'LIVE_REGION_NOT_WORK_REGION',
  'REG_CITY_NOT_LIVE_CITY',
  'REG_CITY_NOT_WORK_CITY',
  'LIVE_CITY_NOT_WORK_CITY',
  'ORGANIZATION_TYPE',
  'FONDKAPREMONT_MODE',
  'HOUSETYPE_MODE',
  'WALLSMATERIAL_MODE',
  'EMERGENCYSTATE_MODE',
  'FLAG_DOCUMENT_2',
  'FLAG_DOCUMENT_3',
  'FLAG_DOCUMENT_4',
  'FLAG_DOCUMENT_5',
  'FLAG_DOCUMENT_6',
  'FLAG_DOCUMENT_7',
  'FLAG_DOCUMENT_8',
  'FLAG_DOCUMENT_9',
  'FLAG_DOCUMENT_10',
  'FLAG_DOCUMENT_11',
  'FLAG_DOCUMENT_12',
  'FLAG_DOCUMENT_13',
  'FLAG_DOCUMENT_14',
  'FLAG_DOCUMENT_15',
  'FLAG_DOCUMENT_16',
  'FLAG_DOCUMENT_17',
  'FLAG_DOCUMENT_18',
  'FLAG_DOCUMENT_19',
  'FLAG_DOCUMENT_20',
  'FLAG_DOCUMENT_21'],
 ['SK_ID_CURR', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH'],
 53)
In [61]:
#Returns us in continuous variables (From 'AMT_INCOME_TOTAL') and in Others, which do not fulfill the function (From 'DEF_60_CNT_SOCIAL_CIRCLE' to 'AMT_REQ_CREDIT_BUREAU_WEEK). 
#The total number of categorical variables is 61 as it tells us at the end.
f_aux.dame_variables_continuas(pd_loan)
Out[61]:
(['AMT_INCOME_TOTAL',
  'AMT_CREDIT',
  'AMT_ANNUITY',
  'AMT_GOODS_PRICE',
  'REGION_POPULATION_RELATIVE',
  'DAYS_REGISTRATION',
  'OWN_CAR_AGE',
  'CNT_FAM_MEMBERS',
  'EXT_SOURCE_1',
  'EXT_SOURCE_2',
  'EXT_SOURCE_3',
  'APARTMENTS_AVG',
  'BASEMENTAREA_AVG',
  'YEARS_BEGINEXPLUATATION_AVG',
  'YEARS_BUILD_AVG',
  'COMMONAREA_AVG',
  'ELEVATORS_AVG',
  'ENTRANCES_AVG',
  'FLOORSMAX_AVG',
  'FLOORSMIN_AVG',
  'LANDAREA_AVG',
  'LIVINGAPARTMENTS_AVG',
  'LIVINGAREA_AVG',
  'NONLIVINGAPARTMENTS_AVG',
  'NONLIVINGAREA_AVG',
  'APARTMENTS_MODE',
  'BASEMENTAREA_MODE',
  'YEARS_BEGINEXPLUATATION_MODE',
  'YEARS_BUILD_MODE',
  'COMMONAREA_MODE',
  'ELEVATORS_MODE',
  'ENTRANCES_MODE',
  'FLOORSMAX_MODE',
  'FLOORSMIN_MODE',
  'LANDAREA_MODE',
  'LIVINGAPARTMENTS_MODE',
  'LIVINGAREA_MODE',
  'NONLIVINGAPARTMENTS_MODE',
  'NONLIVINGAREA_MODE',
  'APARTMENTS_MEDI',
  'BASEMENTAREA_MEDI',
  'YEARS_BEGINEXPLUATATION_MEDI',
  'YEARS_BUILD_MEDI',
  'COMMONAREA_MEDI',
  'ELEVATORS_MEDI',
  'ENTRANCES_MEDI',
  'FLOORSMAX_MEDI',
  'FLOORSMIN_MEDI',
  'LANDAREA_MEDI',
  'LIVINGAPARTMENTS_MEDI',
  'LIVINGAREA_MEDI',
  'NONLIVINGAPARTMENTS_MEDI',
  'NONLIVINGAREA_MEDI',
  'TOTALAREA_MODE',
  'OBS_30_CNT_SOCIAL_CIRCLE',
  'DEF_30_CNT_SOCIAL_CIRCLE',
  'OBS_60_CNT_SOCIAL_CIRCLE',
  'DAYS_LAST_PHONE_CHANGE',
  'AMT_REQ_CREDIT_BUREAU_MON',
  'AMT_REQ_CREDIT_BUREAU_QRT',
  'AMT_REQ_CREDIT_BUREAU_YEAR'],
 ['DEF_60_CNT_SOCIAL_CIRCLE',
  'AMT_REQ_CREDIT_BUREAU_HOUR',
  'AMT_REQ_CREDIT_BUREAU_DAY',
  'AMT_REQ_CREDIT_BUREAU_WEEK'],
 61)

Comprobamos que todas las variables se estan tomando en cuenta de la siguiente forma, el resultado de dame_variables_categóricas es 53 variables categóricas y 4 others (57) y el resultado de dame_variables_continuas que es de 61 variables continuas y 4 others (65) el resultado es 65+57 = 122 variables. Las variables booleanas comentadas anteriormente se introducen en continuas ya que al estar conformadas por 0s y 1s los toma como una variable integer.

Estudiamos y analizamos la variable objetivo, que como hemos dicho anteriormente, es 'TARGET'

In [62]:
print(pd_loan['TARGET'].unique())
#We see how it only returns 0 and 1, in the additional excel we are told that 1= client with payment difficulties: has had a delay in payment of more than X days in at least one of the first Y installments of the loan. 
#in our sample and 0 = other cases (we assume no difficulties).
[1 0]
In [63]:
pd_plot_loan_status = pd_loan['TARGET']\
    .value_counts(normalize=True)\
    .mul(100)\
    .rename('percent')\
    .reset_index()

pd_plot_loan_status.columns = ['TARGET', 'percent']  # Rename the column 'index' to 'TARGET'.

pd_plot_loan_status_conteo = pd_loan['TARGET'].value_counts().reset_index()
pd_plot_loan_status_conteo.columns = ['TARGET', 'count'] # Rename the count column

pd_plot_loan_status_pc = pd.merge(pd_plot_loan_status, pd_plot_loan_status_conteo, on='TARGET', how='inner')

pd_plot_loan_status_pc
Out[63]:
TARGET percent count
0 0 91.927118 282686
1 1 8.072882 24825

Obtenemos por lo tanto que, de nuestra muestra, casi un 92% (282686) de las personas no tuvieron dificultades y aproximadamente un 8% (24825) si que las tuvieron.

In [64]:
fig = px.histogram(pd_plot_loan_status_pc, x="TARGET", y='percent', title='Clients distribution')
fig.show()

Gráficamos la distribución, en caso de tener más de dos valores en la variable como pudieran ser préstamos que aun no han finalizado u otras valores que no son interesantes en nuestro análisis se eliminarian y nos quedariamos con las que realmente necesitamos.

A continuación vemos si se eliminan valores missing y/o nulos y se crean las listas correspondientes a aquellas filas/columnas con nulos con las que trabajaremos de aquí en adelante

In [65]:
pd_series_null_columns = pd_loan.isnull().sum().sort_values(ascending=False)
pd_series_null_rows = pd_loan.isnull().sum(axis=1).sort_values(ascending=False)
print(pd_series_null_columns.shape, pd_series_null_rows.shape)

pd_null_columns = pd.DataFrame(pd_series_null_columns, columns=['columns_nuls'])     
pd_null_rows = pd.DataFrame(pd_series_null_rows, columns=['rows_nulls'])  
pd_null_rows['TARGET'] = pd_loan['TARGET'].copy()
pd_null_columns['columns_percentage'] = pd_null_columns['columns_nuls']/pd_loan.shape[0]
pd_null_rows['rows_percentage']= pd_null_rows['rows_nulls']/pd_loan.shape[1]
(122,) (307511,)

VEMOS QUE NÚMERO DE NULOS HAY POR COLUMNA Y COMO SE DISTRIBUYEN EN PORCENTAJE

In [66]:
pd_null_columns
Out[66]:
columns_nuls columns_percentage
COMMONAREA_MEDI 214865 0.698723
COMMONAREA_AVG 214865 0.698723
COMMONAREA_MODE 214865 0.698723
NONLIVINGAPARTMENTS_MODE 213514 0.694330
NONLIVINGAPARTMENTS_AVG 213514 0.694330
NONLIVINGAPARTMENTS_MEDI 213514 0.694330
FONDKAPREMONT_MODE 210295 0.683862
LIVINGAPARTMENTS_MODE 210199 0.683550
LIVINGAPARTMENTS_AVG 210199 0.683550
LIVINGAPARTMENTS_MEDI 210199 0.683550
FLOORSMIN_AVG 208642 0.678486
FLOORSMIN_MODE 208642 0.678486
FLOORSMIN_MEDI 208642 0.678486
YEARS_BUILD_MEDI 204488 0.664978
YEARS_BUILD_MODE 204488 0.664978
YEARS_BUILD_AVG 204488 0.664978
OWN_CAR_AGE 202929 0.659908
LANDAREA_MEDI 182590 0.593767
LANDAREA_MODE 182590 0.593767
LANDAREA_AVG 182590 0.593767
BASEMENTAREA_MEDI 179943 0.585160
BASEMENTAREA_AVG 179943 0.585160
BASEMENTAREA_MODE 179943 0.585160
EXT_SOURCE_1 173378 0.563811
NONLIVINGAREA_MODE 169682 0.551792
NONLIVINGAREA_AVG 169682 0.551792
NONLIVINGAREA_MEDI 169682 0.551792
ELEVATORS_MEDI 163891 0.532960
ELEVATORS_AVG 163891 0.532960
ELEVATORS_MODE 163891 0.532960
WALLSMATERIAL_MODE 156341 0.508408
APARTMENTS_MEDI 156061 0.507497
APARTMENTS_AVG 156061 0.507497
APARTMENTS_MODE 156061 0.507497
ENTRANCES_MEDI 154828 0.503488
ENTRANCES_AVG 154828 0.503488
ENTRANCES_MODE 154828 0.503488
LIVINGAREA_AVG 154350 0.501933
LIVINGAREA_MODE 154350 0.501933
LIVINGAREA_MEDI 154350 0.501933
HOUSETYPE_MODE 154297 0.501761
FLOORSMAX_MODE 153020 0.497608
FLOORSMAX_MEDI 153020 0.497608
FLOORSMAX_AVG 153020 0.497608
YEARS_BEGINEXPLUATATION_MODE 150007 0.487810
YEARS_BEGINEXPLUATATION_MEDI 150007 0.487810
YEARS_BEGINEXPLUATATION_AVG 150007 0.487810
TOTALAREA_MODE 148431 0.482685
EMERGENCYSTATE_MODE 145755 0.473983
OCCUPATION_TYPE 96391 0.313455
EXT_SOURCE_3 60965 0.198253
AMT_REQ_CREDIT_BUREAU_HOUR 41519 0.135016
AMT_REQ_CREDIT_BUREAU_DAY 41519 0.135016
AMT_REQ_CREDIT_BUREAU_WEEK 41519 0.135016
AMT_REQ_CREDIT_BUREAU_MON 41519 0.135016
AMT_REQ_CREDIT_BUREAU_QRT 41519 0.135016
AMT_REQ_CREDIT_BUREAU_YEAR 41519 0.135016
NAME_TYPE_SUITE 1292 0.004201
OBS_30_CNT_SOCIAL_CIRCLE 1021 0.003320
DEF_30_CNT_SOCIAL_CIRCLE 1021 0.003320
OBS_60_CNT_SOCIAL_CIRCLE 1021 0.003320
DEF_60_CNT_SOCIAL_CIRCLE 1021 0.003320
EXT_SOURCE_2 660 0.002146
AMT_GOODS_PRICE 278 0.000904
AMT_ANNUITY 12 0.000039
CNT_FAM_MEMBERS 2 0.000007
DAYS_LAST_PHONE_CHANGE 1 0.000003
CNT_CHILDREN 0 0.000000
FLAG_DOCUMENT_8 0 0.000000
NAME_CONTRACT_TYPE 0 0.000000
CODE_GENDER 0 0.000000
FLAG_OWN_CAR 0 0.000000
FLAG_DOCUMENT_2 0 0.000000
FLAG_DOCUMENT_3 0 0.000000
FLAG_DOCUMENT_4 0 0.000000
FLAG_DOCUMENT_5 0 0.000000
FLAG_DOCUMENT_6 0 0.000000
FLAG_DOCUMENT_7 0 0.000000
FLAG_DOCUMENT_9 0 0.000000
FLAG_DOCUMENT_21 0 0.000000
FLAG_DOCUMENT_10 0 0.000000
FLAG_DOCUMENT_11 0 0.000000
FLAG_OWN_REALTY 0 0.000000
FLAG_DOCUMENT_13 0 0.000000
FLAG_DOCUMENT_14 0 0.000000
FLAG_DOCUMENT_15 0 0.000000
FLAG_DOCUMENT_16 0 0.000000
FLAG_DOCUMENT_17 0 0.000000
FLAG_DOCUMENT_18 0 0.000000
FLAG_DOCUMENT_19 0 0.000000
FLAG_DOCUMENT_20 0 0.000000
FLAG_DOCUMENT_12 0 0.000000
AMT_CREDIT 0 0.000000
AMT_INCOME_TOTAL 0 0.000000
FLAG_PHONE 0 0.000000
LIVE_CITY_NOT_WORK_CITY 0 0.000000
REG_CITY_NOT_WORK_CITY 0 0.000000
TARGET 0 0.000000
REG_CITY_NOT_LIVE_CITY 0 0.000000
LIVE_REGION_NOT_WORK_REGION 0 0.000000
REG_REGION_NOT_WORK_REGION 0 0.000000
REG_REGION_NOT_LIVE_REGION 0 0.000000
HOUR_APPR_PROCESS_START 0 0.000000
WEEKDAY_APPR_PROCESS_START 0 0.000000
REGION_RATING_CLIENT_W_CITY 0 0.000000
REGION_RATING_CLIENT 0 0.000000
FLAG_EMAIL 0 0.000000
FLAG_CONT_MOBILE 0 0.000000
ORGANIZATION_TYPE 0 0.000000
FLAG_WORK_PHONE 0 0.000000
FLAG_EMP_PHONE 0 0.000000
FLAG_MOBIL 0 0.000000
DAYS_ID_PUBLISH 0 0.000000
DAYS_REGISTRATION 0 0.000000
DAYS_EMPLOYED 0 0.000000
DAYS_BIRTH 0 0.000000
REGION_POPULATION_RELATIVE 0 0.000000
NAME_HOUSING_TYPE 0 0.000000
NAME_FAMILY_STATUS 0 0.000000
NAME_EDUCATION_TYPE 0 0.000000
NAME_INCOME_TYPE 0 0.000000
SK_ID_CURR 0 0.000000

CREO UN DATA SET LLAMADO pd_loan_filter_null en el cual serán eliminadas todas aquellas variables con un % de nulos > al valor de threshold, en este caso 0.9, como ninguna variable supera dicho umbral, el dataset tiene la misma forma que el original con el que estaba trabajando. De esta manera podemos obtener un análisis en el que más variables entren en juego ya que con, por ejemplo, un threshold de 0.6 se eliminarían variables como OWN_CAR_AGE (Años de vida del coche), COMMONAREA_X (Información normalizada acerca de donde vive el cliente), cosas que, a criterio personal en esta práctica, pueden aportar valor al análisis.

In [67]:
threshold=0.9
list_vars_not_null = list(pd_null_columns[pd_null_columns['columns_percentage']<threshold].index)
pd_loan_filter_null = pd_loan.loc[:, list_vars_not_null]
pd_loan_filter_null.shape
Out[67]:
(307511, 122)

VARIABLES CATEGÓRICAS

In [68]:
list_var_cat, othercat,num_cat = f_aux.dame_variables_categoricas(dataset=pd_loan_filter_null)
pd_loan_filter_null[list_var_cat] = pd_loan_filter_null[list_var_cat].astype("category")
list_var_cat
Out[68]:
['FONDKAPREMONT_MODE',
 'WALLSMATERIAL_MODE',
 'HOUSETYPE_MODE',
 'EMERGENCYSTATE_MODE',
 'OCCUPATION_TYPE',
 'NAME_TYPE_SUITE',
 'CNT_CHILDREN',
 'FLAG_DOCUMENT_8',
 'NAME_CONTRACT_TYPE',
 'CODE_GENDER',
 'FLAG_OWN_CAR',
 'FLAG_DOCUMENT_2',
 'FLAG_DOCUMENT_3',
 'FLAG_DOCUMENT_4',
 'FLAG_DOCUMENT_5',
 'FLAG_DOCUMENT_6',
 'FLAG_DOCUMENT_7',
 'FLAG_DOCUMENT_9',
 'FLAG_DOCUMENT_21',
 'FLAG_DOCUMENT_10',
 'FLAG_DOCUMENT_11',
 'FLAG_OWN_REALTY',
 'FLAG_DOCUMENT_13',
 'FLAG_DOCUMENT_14',
 'FLAG_DOCUMENT_15',
 'FLAG_DOCUMENT_16',
 'FLAG_DOCUMENT_17',
 'FLAG_DOCUMENT_18',
 'FLAG_DOCUMENT_19',
 'FLAG_DOCUMENT_20',
 'FLAG_DOCUMENT_12',
 'FLAG_PHONE',
 'LIVE_CITY_NOT_WORK_CITY',
 'REG_CITY_NOT_WORK_CITY',
 'TARGET',
 'REG_CITY_NOT_LIVE_CITY',
 'LIVE_REGION_NOT_WORK_REGION',
 'REG_REGION_NOT_WORK_REGION',
 'REG_REGION_NOT_LIVE_REGION',
 'HOUR_APPR_PROCESS_START',
 'WEEKDAY_APPR_PROCESS_START',
 'REGION_RATING_CLIENT_W_CITY',
 'REGION_RATING_CLIENT',
 'FLAG_EMAIL',
 'FLAG_CONT_MOBILE',
 'ORGANIZATION_TYPE',
 'FLAG_WORK_PHONE',
 'FLAG_EMP_PHONE',
 'FLAG_MOBIL',
 'NAME_HOUSING_TYPE',
 'NAME_FAMILY_STATUS',
 'NAME_EDUCATION_TYPE',
 'NAME_INCOME_TYPE']

Comprobamos que, ciertamente, todas las variables de la lista 'list_cat_vars' son categóricas/'category'

In [69]:
unique_dtypes_cat = pd_loan_filter_null[list_var_cat].dtypes.apply(lambda x: x.name).unique()
print("Unique data types in list_var_cat:", unique_dtypes_cat)
Unique data types in list_var_cat: ['category']

VARIABLES NUMÉRICAS

In [70]:
list_var_cont, othercont, num_cont = f_aux.dame_variables_continuas(dataset=pd_loan_filter_null)
pd_loan_filter_null[list_var_cont] = pd_loan_filter_null[list_var_cont].astype("float64")
list_var_cont
Out[70]:
['COMMONAREA_MEDI',
 'COMMONAREA_AVG',
 'COMMONAREA_MODE',
 'NONLIVINGAPARTMENTS_MODE',
 'NONLIVINGAPARTMENTS_AVG',
 'NONLIVINGAPARTMENTS_MEDI',
 'LIVINGAPARTMENTS_MODE',
 'LIVINGAPARTMENTS_AVG',
 'LIVINGAPARTMENTS_MEDI',
 'FLOORSMIN_AVG',
 'FLOORSMIN_MODE',
 'FLOORSMIN_MEDI',
 'YEARS_BUILD_MEDI',
 'YEARS_BUILD_MODE',
 'YEARS_BUILD_AVG',
 'OWN_CAR_AGE',
 'LANDAREA_MEDI',
 'LANDAREA_MODE',
 'LANDAREA_AVG',
 'BASEMENTAREA_MEDI',
 'BASEMENTAREA_AVG',
 'BASEMENTAREA_MODE',
 'EXT_SOURCE_1',
 'NONLIVINGAREA_MODE',
 'NONLIVINGAREA_AVG',
 'NONLIVINGAREA_MEDI',
 'ELEVATORS_MEDI',
 'ELEVATORS_AVG',
 'ELEVATORS_MODE',
 'APARTMENTS_MEDI',
 'APARTMENTS_AVG',
 'APARTMENTS_MODE',
 'ENTRANCES_MEDI',
 'ENTRANCES_AVG',
 'ENTRANCES_MODE',
 'LIVINGAREA_AVG',
 'LIVINGAREA_MODE',
 'LIVINGAREA_MEDI',
 'FLOORSMAX_MODE',
 'FLOORSMAX_MEDI',
 'FLOORSMAX_AVG',
 'YEARS_BEGINEXPLUATATION_MODE',
 'YEARS_BEGINEXPLUATATION_MEDI',
 'YEARS_BEGINEXPLUATATION_AVG',
 'TOTALAREA_MODE',
 'EXT_SOURCE_3',
 'AMT_REQ_CREDIT_BUREAU_MON',
 'AMT_REQ_CREDIT_BUREAU_QRT',
 'AMT_REQ_CREDIT_BUREAU_YEAR',
 'OBS_30_CNT_SOCIAL_CIRCLE',
 'DEF_30_CNT_SOCIAL_CIRCLE',
 'OBS_60_CNT_SOCIAL_CIRCLE',
 'EXT_SOURCE_2',
 'AMT_GOODS_PRICE',
 'AMT_ANNUITY',
 'CNT_FAM_MEMBERS',
 'DAYS_LAST_PHONE_CHANGE',
 'AMT_CREDIT',
 'AMT_INCOME_TOTAL',
 'DAYS_REGISTRATION',
 'REGION_POPULATION_RELATIVE']
In [71]:
pd_loan_filter_null[list_var_cont].dtypes
unique_dtypes = set(pd_loan_filter_null[list_var_cont].dtypes)
print(unique_dtypes)
{dtype('float64')}

Al igual que con las variables categóricas, comprobamos que en la lista de variables continuas solo hay variables continuas tal que 'float64'/'int64', en este caso 'float64'

Al igual que en la plantilla HTML, voy a tratar algunas variables que pueden ser interesante conocer datos a priori de ellas

A través del siguiente código, vemos el tipo de contrato que tiene el cliente diferenciando entre CASH LOANS (PRÉSTAMOS EN EFECTIVO DE UNA CANTIDAD FIJA LA CUAL SE ENTREGA AL MOMENTO DE PEDIRLO) y REVOLVING LOANS (PRÉSTAMOS MÁS FLEXIBLES EN LOS QUE SE ABRE UNA LINEA DE CRÉDITO CON UN LIMITE PREESTABLECIDO DE LA CUAL SE PUEDE RETIRAR DINERO HASTA DICHO LÍMITE) y vemos como se distribuye en 90,48% aproximadamente pidieron un 'Cash Loan' y el 9,52% restante solicitaron 'Revolving loans'

In [72]:
pd_loan_filter_null['NAME_CONTRACT_TYPE'].value_counts()
Out[72]:
NAME_CONTRACT_TYPE
Cash loans         278232
Revolving loans     29279
Name: count, dtype: int64
In [73]:
fig = px.histogram(
    pd_loan_filter_null,
    x="NAME_CONTRACT_TYPE",
    title='Credit distribution by contract type',
    histnorm='percent',  # Convert counts to percentage
    labels={"NAME_CONTRACT_TYPE": "Contract type"} #AXIS X
)

#AXIS Y
fig.update_yaxes(title_text="Percentage of the poblation")

fig.show()

Podría ser interesante ver también el género de los tomadores del préstamo en este caso, casi 2/3 de la muestra son de género femenino y 1/3 masculino.

In [74]:
pd_loan_filter_null['CODE_GENDER'].value_counts()
Out[74]:
CODE_GENDER
F      202448
M      105059
XNA         4
Name: count, dtype: int64

A través del siguiente código vemos, en porcentaje, la fuente de ingresos de los clientes

In [75]:
print(pd_loan_filter_null["NAME_INCOME_TYPE"].value_counts().count())
pd_loan_filter_null["NAME_INCOME_TYPE"]\
    .apply(lambda x: str(x).lower().strip()).value_counts(normalize=True)#.count()
8
Out[75]:
NAME_INCOME_TYPE
working                 0.516320
commercial associate    0.232892
pensioner               0.180033
state servant           0.070576
unemployed              0.000072
student                 0.000059
businessman             0.000033
maternity leave         0.000016
Name: proportion, dtype: float64

Más del 50% se encuentran como 'Working' o trabajando, esta variable puede ser engañosa por qué realmente otros valores como commercial associate, state servant o businessman también contaría como 'working'. Podría ser interesante de cara al futuro simplemente dividir entre aquellos que trabajan (working, commercial,businessman,state servant) y aquellos que, entiendo, no trabajan (pensioner,unemployed,student,martenity leave) para un análisis entre poblacion activa y no activa.

A través de sumar la lista que nos ha devuelto los 'others' para los 2 tipos de variables obtenemos las 8 variables que no son ni categóricas ni continuas

In [76]:
other_variables = othercat + othercont
other_variables
Out[76]:
['DAYS_ID_PUBLISH',
 'DAYS_EMPLOYED',
 'DAYS_BIRTH',
 'SK_ID_CURR',
 'AMT_REQ_CREDIT_BUREAU_HOUR',
 'AMT_REQ_CREDIT_BUREAU_DAY',
 'AMT_REQ_CREDIT_BUREAU_WEEK',
 'DEF_60_CNT_SOCIAL_CIRCLE']

Guardamos el dataframe pd_loan_filter_null a .csv para poder continuar trabajando con el en otro notebook y no realizar notebooks excesivamente largos

In [77]:
pd_loan_filter_null.to_csv("../data/pd_data_initial_preprocessing.csv")